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7i  live  tailored  achievement  testing  study  was  conducted  to  compare 
procedures  based  on  the  one-  and  three-parameter  logistic  models.  Pre¬ 
vious  studies.  Investigating  the  application  of  these  mqdels  to  achieve¬ 
ment  testing,  have  yielded  inconclusive  Results  because  of  methodological 
problems.  Close  scrutiny  of  these  investigations  indicated  two  problems 
that  apparently  contributed  to  the  ambiguous  results.  One  problem  was 
the  procedures  by  which  item  calibrations  were  linked,  and  the  other 
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problem  Mas  In  the  Item  selection  procedures.  This  second  problem  con¬ 
cerned  stepslze,  points  of  entry  Into  the  Item  pools,  and  Information 
cutoff  levels.  The  objective  of  the  current  study  Mas  to  compare  the 
one-  and  three-parameter  logistic  models  using  the  Improved  procedures. 

A  total  of  88  students  enrolled  In  an  Introductory  measurement  course 
at  the  University  of  Missouri -Columbia  served  as  examinees  for  the  study. 

A  counterbalanced  test-retest  design  Mas  employed.  In  Mhlch  there  Mere 
tMo  separate  test  sessions  one  Meek  apart  for  each  examinee. comparisons 
were  based  upon  (a)  test-retest  reliability,  (b)  ability  estfmale5''yie]ded 
by  the  procedures,  (c)  the  Information  yielded  bv  the  procedures,  (d)  ^ 

the  number  of  Items  the  methods  administered,  (e)  goodness  of  fit  of  the 
models  based  on  mean  square  deviations,  and  (f)  the  correlations  of  esti¬ 
mated  true  scores,  based  on  ability  estimates,  with  an  outside  criterion. 

In  addition,  an  attitude  survey  was  administered  after  each  test  session 
to  determine  student  attitudes  toward  the  tailored  tests.  The  results 
of  the  study  Indicated  that  both  tailored  tests  had  higher  reliabilities 
than  a  conventional  paper-and-pencll  test  over  the  same  material.  The 
three-parameter  procedure  had  higher  test  information  than  the  one-parameter 
procedure  and  the  conventional  test.  Neither  procedure  yielded  satisfactory 
content  validity.  The  attitude  survey  results  Indicated  generally  favor¬ 
able  student  attitudes  toward  tailored  testing. 
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A  Successful  Application  of  Latent  Trait  Theory 


to  Tailored  Achieveicnt  Testing 


Tailored  testing  has  been  proposed  as  an  alternative  measurement 
technique  because  of  its  potential  for  dealing  with  some  of  the  major 
problems  of  conventional  testing.  Conventional  testing.  In  which  the 
same  test  items  are  given  to  all  examinees,  often  results  in  test  Items 
of  Inappropriate  difficulty  being  administered  to  many  examinees.  If 
test  items  are  too  difficult,  an  examinee  may  resort  to  random  guess¬ 
ing  or  even  omission  of  items,  and  if  the  items  are  not  difficult  enough, 
the  test  may  not  be  challenging  to  the  examinee.  As  a  result,  the  stan¬ 
dard  error  of  measurement  for  conventional  tests  usually  is  higher  at 
the  extremes  of  the  ability  range,  resulting  in  tests  that  are  most  accur¬ 
ate  for  examinees  of  average  ability.  This  restricted  range  of  accuracy 
is  reflected  in  lowered  test  reliabilities. 

Other  problems,  such  as  time  limit  pressures  and  the  effects  of  test 
administration  differences  (Weiss,  1974),  may  also  affect  the  precision 
of  measurement  of  conventional  tests.  In  order  to  deal  with  these  prob¬ 
lems,  tailored  testing  procedures  were  developed  (Lord,  1970).  The  purpose 
of  this  report  is  to  describe  a  successful  application  of  tailored  test¬ 
ing  procedures  to  achievement  measurement.  First,  however,  it  may  be 
helpful  to  discuss  both  the  rationale  and  primary  characteristics  of 
tailored  testing,  and  earlier  attempts  at  its  utilization. 

Tailored  testing  procedures  were  designed  to  reduce  the  errors  of 
measurement  when  estimating  an  examinee's  ability  or  level  of  achieve¬ 
ment  by  attempting  to  administer  to  each  examinee  only  items  of  appro¬ 
priate  difficulty.  This  is  accomplished  by  selecting  for  administration 
items  that  maximize  the  information  about  an  examinee's  estimated  ability 
level.  That  is,  each  examinee  receives  a  test  which  is  "tailored"  to 
his  ability  level.  This  tailoring  hopefully  results  in  Increased  precision 
of  measurement. 

The  implementation  of  tailored  testing  procedures  usually  requires 
computer  capabilities.  One  reason  a  computer  Is  needed  is  that  tailored 
testing  Is  often  based  on  item  characteristic  curve  (ICC)  theory  (Lord, 
1952;  Lord  and  Novlck,  1968).  ICC  theory  Involves  mathematical  models 
of  sufficient  sophistication  as  to  require  the  use  of  a  computer  for  para¬ 
meter  estimation.  One  of  the  first  requirements  for  tailored  testing  Is 
a  precalibrated  pool  of  Items  from  which  test  Items  can  be  selected  for 
administration.  The  calibration  of  the  Item  pool  Is  usually  accomplished 
by  using  one  of  several  existing  calibration  programs  (Wright  and 
Panchapakesan,  1969;  Wood,  Wlngersky,  and  Lord,  1976;  and  Urry,  1975) 
on  conventional  test  Item  response  data  In  order  to  obtain  Item  parameter 
estimates  for  the  one-parameter  or  three-parameter  models. 


Another  step  which  requires  computer  capabilities  Is  the  operation 
of  the  tailored  testing  procedures  on  an  Interactive  basis  with  the  examinee. 
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This  tailored  testing  program  Is  controlled  by  a  number  of  program  para¬ 
meters,  such  as  the  point  of  entry  Into  the  Item  pool,  the  procedure  for 
estimating  ability  (usually  either  a  Bayesian  or  maximum  likelihood  tech¬ 
nique),  the  Item  selection  method,  and  a  rule  for  terminating  the  test. 

Once  the  Item  pool  has  been  created  and  the  procedures  Implemented, 
there  are  several  problems  that  may  arise.  Among  these  Is  a  possible 
lowering  of  the  quality  of  Item  calibrations  when  It  Is  necessary  to 
link  small  sample  calibrations  of  several  tests  In  order  to  create  a 
sufficiently  large  Item  pool.  Another  problem  Is  the  nonconvergence  of 
the  ability  estimation  procedure,  and  a  third  stems  from  possible  viola¬ 
tions  of  the  assumptions  of  the  latent  trait  models.  This  last  case  may 
occur  when  an  extension  Is  made  from  ability  testing  to  the  measurement 
of  achievement.  In  the  research  reported  here  an  attempt  to  solve  these 
problems  will  be  presented. 

There  are  a  number  of  models  available  for  use  In  tailored  testing, 
most  of  which  belong  to  a  class  of  models  referred  to  as  latent  trait 
models.  Within  this  class  are  a  number  of  ICC  models,  also  known  as  Item 
Response  Theory  (IRT)  models.  The  particular  models  chosen  for  this  study 
are  described  below. 


Latent  Trait  Models 

The  Rasch  (1960),  or  one-parameter  logistic  (1PL)  model,  as  described 
by  Wright  (1977),  requires  one  ability  parameter,  e^,  for  each  examinee, 

and  one  difficulty  parameter,  b^,  for  each  Item  In  order  to  describe  the 

interaction  of  an  examinee  and  an  Item.  In  exponentlonal  form  the  1 PL 
model  Is  given  by 


P(u,j) 


exp(uii(e1  -  bj)) 
1  ♦  exp(e]  -  b,) 


where  u^j  Is  the  score  (0  or  1)  on  Item  1  for  Examinee  j,  and  b^  are 
as  defined  above,  and  P(u^)  Is  the  probability  that  u^j  Is  0  or  1. 

The  three-parameter  logistic  (3PL)  model  as  presented  by  Blrnbaum 
(1968)  requires  three  parameters  for  each  Item.  As  In  the  1PL  model, 
the  3PL  model  requires  one  ability  parameter  for  each  examinee.  The  3PL 
model  Is  given  by 


w 


c1  ♦  (1  -  Cj) 


expfDa^e.,  -  b^) 

1  +  exp(ba1(0j  -  t 


where  6j  and  b^  are  as  defined  above,  a^  Is  the  Item  discrimination  para¬ 
meter,  c^  Is  the  Item  guessing  parameter,  and  D  Is  a  scaling  constant  equal 
to  1.7. 
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Both  these  models  assume  that  the  Items  are  dichotomously  scored, 
and  that  local  independence  holds.  Also,  the  assumption  is  made  that 
the  latent  trait  being  measured  Is  unidimensional .  (For  a  full  discussion 
of  the  assumptions  of  these  models  see  Lord  and  Novlck,  1968.)  Of  parti¬ 
cular  significance  Is  the  assumption  of  unidimensionality.  When  applying 
factor  analytic  methods  to  ability  tests,  generally  one  dominant  factor 
Is  found.  But  achievement  tests  are  usually  constructed  with  a  goal  of 
multidimensional  measurement.  This  multi dimensionality  requires  the 
serious  consideration  of  the  robustness  of  the  models  to  the  violation 
of  the  unidimensionality  assumption  when  latent  trait  models  are  applied 
to  achievement  testing.  Before  making  this  examination  It  will  be  help¬ 
ful  to  sumnarlze  the  results  of  a  previous  study  that  used  a  similar 
tailored  testing  methodology  and  that  demonstrated  that  tailored  test¬ 
ing  procedures  could  be  successfully  applied  to  a  uni dimensional  voca¬ 
bulary  test  (Koch  and  Reckase,  1978). 


Vocabulary  Tailored  Testing  Study 


The  purpose  of  the  vocabulary  study  was  to  compare  the  1PL  and  3PL 
models  In  a  tailored  testing  application  to  vocabulary  ability  measure¬ 
ment.  The  calibration  programs  used  were  the  MAX  program  (Wright  and 
Panchapakesan,  1969)  for  the  1PL  model  and  the  L0G1ST  program  (Wood, 
Wingersky,  and  Lord,  1976)  for  the  3PL  model.  Items  were  selected  to 
maxi ml z  the  Information  function  (Birnbaum,  1968)  for  the  maximum  like 
lihood  ability  estimate. 

The  results  of  this  study  Indicated  that,  while  there  were  some 
problems,  either  of  the  two  models  could  be  successfully  applied  to  voc 
abulary  ability  measurement.  In  particular,  the  reliabilities  reported 
(a  combination  of  test-retest  and  equivalent  forms  reliabilities)  were 
r  *  .77  for  the  3PL  procedure  and  r  *  .61  for  the  1PL  procedure.  In 
terms  of  Information,  the  3PL  procedure  outperformed  the  1PL  procedure, 
and.  In  the  ability  estimate  levels  between  -2.0  and  +.50,  the  3PL  pro¬ 
cedure  actually  yielded  greater  Information  than  the  longer  traditional 
paper-and-pencil  test. 

One  of  the  problems  encountered  In  this  study  was  the  failure  of 
the  3PL  procedure  to  converge  to  ability  estimates  in  nearly  one-third 
of  the  cases.  When  these  cases  were  Included  In  the  analyses  the  3PL 
reliability  dropped  to  r  ■  .36.  The  hypothesis  was  put  forward  that 
the  cases  of  nonconvergence  occurred  because  the  Items  In  the  Item  pool 
were  too  difficult  for  many  of  the  examinees. 


Tailored  Achievement  Testing 


The  vocabulary  test  In  the  above  study  was,  of  course,  an  ability 
test,  and  relatively  unidimensional  (the  first  factor  accounted  for  41% 
of  the  variance).  The  measurement  of  achievement  presents  quite  a  differ 
ent  problem.  The  multidimensionality  of  achievement  tests  raises  the 
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question  of  the  robustness  of  ICC  theory  with  respect  to  the  violation 
of  the  uni dimensionality  assumption. 

Very  little  has  been  published  In  the  literature  dealing  with  appli¬ 
cations  of  tailored  testing  to  achievement  measurement.  In  one  study 
conducted  by  Bejar,  Weiss,  and  Kingsbury  (1977),  a  biology  achievement 
test  was  used,  but  that  test  was  found  to  have  a  very  dominant  first  factor. 
Not  surprisingly  the  calibration  of  the  Item  pool  with  the  ICC  model  proved 
adequate.  The  use  of  the  ICC  model  on  a  one  factor  achievement  test  would 
not  be  expected  to  differ  much  from  use  on  a  unidimensional  ability  test. 

Research  reported  by  Brown  and  Weiss  (1977),  in  which  a  tailored 
testing  procedure  was  used  for  an  achievement  test  having  several  content 
areas,  indicated  that  utilizing  inter-subtest  branching  can  provide  pre¬ 
cision  of  measurement  equal  to  that  of  the  conventional  achievement  test. 
However,  in  this  study  each  content  area  was  calibrated  separately,  rather 
than  together  as  a  multidimensional  item  pool.  Therefore,  even  though 
tailored  testing  procedures  were  applied  to  a  multidimensional  achieve¬ 
ment  test,  the  issue  of  the  robustness  of  the  ICC  model  with  respect  to 
violation  of  the  assumption  of  uni dimensionality  was  not  addressed. 

The  issue  was  addressed,  however,  in  a  study  reported  by  Koch  and 
Reckase  (1979).  In  this  study  achievement  tests  were  not  calibrated  by 
content  area,  but  rather  each  test  was  calibrated  as  a  whole.  The  achieve¬ 
ment  tests  used  were  classroom  tests  from  an  undergraduate  course  in  educa¬ 
tional  measurement.  The  tests  were  each  calibrated  using  both  the  MAX 
program  (Wright  and  Panchapakesan,  1969)  and  the  LOGIST  program  (Wood, 
Wingersky,  and  Lord,  1976),  yielding  for  each  test  1PL  and  3PL  item  para¬ 
meter  estimates.  All  the  tests  had  items  in  common,  so  item  calibration 
linking  was  performed  using  the  Least  Squares  Method  (Reckase,  1979)  in 
order  to  form  a  large  item  pool  for  tailored  testing.  Then  a  counter¬ 
balanced  test-retest  design  was  employed,  with  each  examinee  taking  both 
1PL  and  3PL  tests  in  each  of  two  sessions.  For  both  the  1  PL  and  3PL  pro¬ 
cedures,  items  were  selected  for  administration  to  maximize  the  value  of 
the  information  function  (Blrnbaum,  1968). 

The  results  of  this  study  indicated  a  number  of  problem  areas  in 
applying  tailored  testing  to  multidimensional  achievement  testing.  Both 
procedures  appeared  to  be  inadequate  with  regard  to  reliability,  with 
r  *  .44  for  the  1PL  test  and  r  ■  0.0  for  the  3PL  test.  In  neither  case 
did  test  information  equal  the  information  yielded  by  the  paper-and-pencl 1 
test,  although  the  3PL  test  came  substantially  closer  than  did  the  1PL 
test.  Moreover,  while  the  item  pool  accurately  reflected  the  weighting 
of  the  content  areas  in  the  paper-and-pencl 1  exam,  the  items  actually 
selected  by  the  two ‘procedures  showed  significant  deviation  from  the  con¬ 
tent  distribution  of  both  the  item  pool  and  the  course  exam.  It  should 
be  noted  here  that  no  branching  among  content  areas  was  attempted.  The 
purpose  was  to  see  if  selecting  items  on  the  basis  of  information  alone 
would  approximate  the  content  area  weightings  of  the  item  pool. 

One  other  problem  that  was  encountered  was  nonconvergence  of  the 
3PL  maximum  likelihood  ability  estimation  in  about  eight  percent  of  the 
cases.  Recall  that  it  occurred  in  almost  one-third  of  the  cases  in  the 
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vocabulary  study  previously  discussed.  The  substantial  reduction  in 
nonconvergence  cases  was  attributed  to  the  use  of  an  item  pool  of  more 
appropriate  difficulty  in  the  achievement  testing  study. 

A  number  of  possible  explanations  were  suggested  for  the  inadequate 
performance  of  the  1PL  and  3PL  procedures.  Among  these  were  unstable 
item  parameter  estimates  due  to  small  sample  sizes,  a  compounding  of  that 
instability  due  to  the  linking  procedures,  poor  selection  of  entry  points 
into  the  item  pool,  the  possibility  that  latent  trait  models  may  not  be 
robust  with  respect  to  the  violation  of  the  assumption  of  uni dimension¬ 
ality,  and  the  nonconvergence  of  the  3PL  tailored  tests  when  using  maxi¬ 
mum  likelihood  ability  estimation. 

It  is  clear  from  looking  at  this  study  that,  when  applying  tailored 
testing  to  achievement  measurement,  careful  attention  must  be  paid  to 
the  operational  characteristics  of  the  procedures.  In  order  to  investi¬ 
gate  the  robustness  of  the  ICC  model  with  respect  to  violation  of  the 
uni  dimensionality  assumption,  it  is  first  necessary  to  eliminate  problems 
such  as  unstable  item  calibrations,  poor  linking  procedures,  and  less 
than  optimal  operational  characteristics.  The  present  study  is  an  attempt 
to  do  just  that. 


Method 


Item  Pool  Construction 


Calibration  The  test  items  that  were  calibrated  for  use  in  the  item 
pool  were  obtained  from  a  series  of  classroom  achievement  tests  adminis¬ 
tered  in  an  undergraduate  course  on  educational  measurement  and  evalua¬ 
tion.  Items  were  taken  from  six  different  tests  of  fifty  items  each, 
covering  the  content  area  of  educational  evaluation  techniques.  The  tests 
were  calibrated  using  both  the  MAX  program  (Wright  and  Panchapakesan, 
1969),  and  the  LOGIST  program  (Wood,  Wingersky,  and  Lord,  1976),  which 
yielded  the  1PL  and  3PL  item  parameter  estimates,  respectively.  Sample 
sizes  ranged  from  148  examinees  to  316  examinees.  The  dates  of  test  ad¬ 
ministration  and  sample  sizes  are  presented  In  Table  A-l  of  Appendix  A. 

Linking  It  would  be  quite  desirable  to  have  a  large  sample  of  per¬ 
haps  lOdd  examinees  to  which  a  single  test  of  150  Items  or  more  could 
be  administered.  This  would  obviate  the  need  for  linking  and  would  pro¬ 
vide  more  stable  item  parameter  estimates.  Unfortunately,  it  is  not  often 
possible  to  administer  a  test  to  as  many  as  1000  examinees  at  one  time. 
Moreover,  for  security  purposes  it  is  usually  necessary  to  alter  a  test 
between  administrations,  although  there  may  be  numerous  Items  in  common 
from  one  administration  of  a  test  to  the  next.  Because  of  this.  It  is 
generally  necessary  to  link  together  a  series  of  small  sample  calibrations 
to  get  all  the  item  parameter  estimates  on  the  same  scale.  The  linking 
is  necessary  because  the  item  parameter  estimates  yielded  by  the  latent 
trait  calibration  programs  are  only  Invariant  to  within  a  linear  trans¬ 
formation  due  to  the  arbitrary  nature  of  the  zero  point  and  the  unit  of 
measurement  defined  by  the  separate  calibrations  (Reckase,  1979). 
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The  linking  of  the  1PL  "b"  values  (item  difficulty  parameter  esti¬ 
mates)  was  accomplished  using  the  Major  Axis  Method  (Reckase,  1979). 

Items  in  common  to  the  tests  to  be  linked  were  identified,  and  for  each 
test  a  mean  difficulty  value  was  computed  for  those  items  in  common. 

One  of  the  tests  was  arbitrarily  designated  as  the  calibration  base,  and 
a  second  test  calibration  was  linked  to  it  by  adding  to  each  item's  b- 
value  in  the  second  test  a  scaling  constant  equal  to  the  difference  between 
the  mean  difficulty  values  that  were  computed  on  the  convnon  items.  The 
adding  of  the  constant  to  the  second  test  difficulty  values  put  them  on 
the  same  scale  as  the  calibration  base  items.  At  this  point  the  "b"  values 
for  the  common  Items  were  combined  across  these  two  tests  using  a  weighted 
average  procedure  based  on  the  sample  sizes  of  the  respective  calibrations. 
This  same  procedure  was  repeated  for  all  of  the  remaining  tests  to  be 
linked  using  as  a  calibration  base  the  composite  of  previously  linked 
tests. 

The  linking  of  the  3PL  calibrations  was  done  using  the  Maximum  Like¬ 
lihood  Method.  This  procedure  is  more  fully  described  by  Reckase  (1979), 
and  a  brief  summary  here  will  suffice.  This  method  required  the  use  of 
the  LOGIST  program  in  order  to  simultaneously  calibrate  the  tests.  The 
test  data  were  first  edited  into  a  single  large  matrix.  Items  appear¬ 
ing  on  Test  1  but  not  on  Test  2  were  coded  as  not  reached  for  Test  2,  and 
in  this  way  were  not  used  for  the  calibration  of  Test  2.  The  items  in 
common  to  the  tests  ensured  that  the  calibrations  were  all  on  the  same 
scale.  The  full  matrix  of  responses  and  not  reached  codes  were  analyzed 
to  obtain  the  "a",  "b" ,  and  "c"  parameter  estimates. 

Item  Pool  Characteristics  The  1PL  and  3PL  test  procedures  used 
identical  pools  of  183  items.  Table  1  summarizes  the  means,  standard 
deviations,  and  ranges  of  the  parameter  estimates.  The  correlation 
between  the  respective  "b"  values  was  r  *  .902.  Note  that  the  means 
and  standard  deviations  of  the  "b"  values  for  the  two  calibration  pro¬ 
cedures  are  not  directly  comparable  because  the  origin  and  unit  of  measure¬ 
ment  set  by  the  two  calibration  programs  are  not  the  same. 

The  distributions  of  the  Item  parameter  estimates  are  shown  in  Figures 
1-A,  1-B,  1-C,  and  1-D.  Probably  the  most  disturbing  aspect  of  these 
distributions  is  the  positive  skewness  of  the  3PL  discrimination  values. 
Approximately  75  percent  of  the  items  had  discrimination  values  below 
.75.  Figure  1-B  shows  that  the  3PL  difficulty  values  were  also  positively 
skewed.  The  3PL  item  pool  did  not  meet  all  of  the  guidelines  for  item 
pools  as  set  out  by  Urry  (1977).  These  guidelines  include:  item  discrim¬ 
ination  values  should  be  over  .8;  item  difficulty  values  should  be  evenly 
and  widely  distributed  from  about  -2.0  to  +2.0;  guessing  values  should 
be  less  than  .3;  and  there  should  be  at  least  100  Items  in  the  pool. 

The  1PL  difficulty  values  (shown  in  Figure  1-D)  were  much  more  uniformly 
distributed. 

Figures  2  and  3  show  the  Information  curves  for  the  1PL  and  3PL  item 
pools,  respectively.  Again,  the  3PL  curve  is  positively  skewed,  with  the 
most  information  being  yielded  at  the  lower  range  of  the  ability  scale. 

The  1  PL  Item  pool  Information  plots  shows  a  considerably  more  uniform 
curve. 
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Table  1 


Descriptive  Statistics  of  Item  Parameter 
Estimates  for  Tailored  Testing  Item  Pools 


One-Parameter 

Calibration 

Three-Parameter 

Calibration 

bi 

ai 

bi 

Ci 

Mean 

-0.030 

.610 

-1 .674 

.180 

Median 

-0.074 

.485 

-1.764 

.180 

1.396 

.484 

3.361 

.010 

Skewness 

-0.284 

1.517 

1 .406 

-2.536 

Low  Value 

-5.279 

.010b 

-9.999a 

.101 

High  Value 

3.052 

2.001° 

14.834 

.244 

Note.  Both  pools  contained  183  items. 


aThis  value  was  an  arbitrary  lower  limit  on  the  3 PL  difficulty  para¬ 
meter  estimates. 

bThis  value  is  an  upper  limit  set  by  the  LOGIST  program. 


Tailored  Testing  Procedures 

The  procedures  actually  used  for  the  tailored  testing  sessions  have 
been  thoroughly  described  elsewhere  (Koch  and  Reckase,  1978,  1979;  Patience, 
1977),  and  so  only  a  brief  summary  is  given  here. 


Tailored  testing  procedures  have  three  main  components:  an  item 
selection  routine,  an  ability  estimation  technique,  and  a  stopping  rule. 
In  this  study  both  the  1PL  and  the  3PL  procedures  selected  items  to  maxi¬ 
mize  the  value  of  the  information  function  (Birnbaum,  1968)  at  the  most 
recent  ability  estimate.  For  the  1  PL  testing  procedure  the  formula  for 
item  information  is  given  by 


exp[-(e.  -  b.)] 

- J — - *(e 

{1  +  expi-(6j  -  b.)ir  J 


(3) 


where  I.(e.)  is  the  information  for  Item  i  at  Ability  Level  e.  for  Exam- 

1  J  J 

inee  j,  e^.  and  b.  are  as  previously  defined,  and  \p(x)  is  the  logistic 

probability  density  function.  For  the  3PL  testing  procedure  the  formula 
for  item  information  is  given  by 


I^ej)  =  D2ai2K»[DLi(ej)l  -  D2a1P1(eJ)^[DL1(0j)  -  log  c.]  (4) 
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FIGURE  2 


INFORMATION  CURVE  FOR 


1PL  ITEM  POOL 


where  I i ( 0 j )  The  value  of  the  item  information  at  0j,  L..(ej)  =  a.(6j  ■ 

p.(e.)  is  the  probability  of  a  correct  response  to  Item  i  given  Ability 
1  J 

0j,  and  ij>(x)  and  the  other  parameters  are  as  defined  earlier.  The  total 

test  information  was  defined  by  Birnbaum  (1968)  as  the  sum  of  the  item 
information  values: 


Mej>  ■ 


These  formulas  were  used  in  the  tailored  testing  procedure  to  compute 
the  information  for  each  item  at  the  examinee's  current  ability  estimate. 


FIGURE  3 

INFORMATION  CURVE  FOR 
3PL  ITEM  POOL 


The  Item  with  the  greatest  Information  at  that  ability  estimate  was  then 
administered  to  the  examinee,  with  the  provision  that  the  Information 
must  be  greater  than  .246  for  the  1  PL  procedure  and  .65  for  the  3PL  pro¬ 
cedure.  These  values  were  chosen  based  on  other  research,  since  they 
minimize  errors  In  estimation.  The  Information  cutoffs  were  different 
for  the  two  procedures  because  the  ability  scales  for  the  two  models  are 
different.  If  no  Item  were  available  with  Information  values  above  these 
minlmums,  testing  was  terminated. 

Before  testing  began  no  ability  estimates  were  available  for  the 
examinees,  so  Initial  estimates  were  assigned  to  set  the  starting  points 
In  the  Item  pool.  The  Initial  ability  estimates  for  this  study  were  set 
by  random  assignment  to  be  either  -1.856  or  -1.500  for  the  3PL  test,  and 
to  be  either  -.494  or  .496  for  the  1PL  test.  These  values  represent 
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difficulty  values  near  the  medians  of  the  item  pool  difficulty  distri¬ 
butions  with  one  on  either  side  of  the  median.  Two  different  points 
were  used  in  order  to  provide  different  initial  items  from  one  session 
to  the  next.  The  first  item  was  then  selected  to  maximize  information 
at  the  initial  ability  estimate.  If  that  item  were  correctly  answered 
the  ability  estimate  was  increased  by  a  fixed  stepsize,  and  if  it  were 
incorrectly  answered  the  ability  estimate  was  decreased  by  a  fixed  step- 
size.  This  fixed  stepsize  procedure  was  used  until  a  maximum  likelihood 
ability  estimate,  the  mode  of  the  likelihood  distribution,  could  be 
obtained  (i.e.,  when  both  correct  and  incorrect  responses  were  obtained). 

The  stepsize  used  for  the  1PL  procedure  was  .693,  and  for  the  3PL  proce¬ 
dure  it  was  .4.  Each  new  item  was  selected  to  maximize  the  information 
at  the  new  ability  estimate,  with  the  restriction  that  no  item  could  be 
used  more  than  once. 

Two  stopping  rules  were  used  for  the  testing  procedures.  The  tests 
were  terminated  when  there  were  no  items  left  in  the  item  pool  with  in¬ 
formation  at  the  current  ability  estimate  greater  than  the  minimum  specified 
above,  or  when  20  items  had  been  administered. 


Design 

This  study  employed  a  counterbalanced  design  using  two  sessions 
one  week  apart.  Each  session  included  both  a  test  based  on  the  1PL  model 
and  a  test  based  on  the  3PL  model.  Counterbalancing  was  achieved  by  rever 
sal  of  the  order  of  presentation  of  the  two  tests  from  one  session  to  the 
next.  The  test-retest  design  was  used  to  facilitate  reliability  compari¬ 
sons. 


During  the  sessions  the  tests  were  administered  with  no  perceptible 
break  between  them.  The  second  test  was  begun  iirmediately  after  the  final 
ability  estimate  for  the  first  test  was  obtained.  Since  both  item  pools 
contained  the  same  items,  some  of  the  items  in  the  first  test  were  repeated 
in  the  second  test.  Therefore,  examinees  were  told  that  they  might  receive 
the  same  item  more  than  once.  The  tailored  tests  were  administered  on 
Applied  Digital  Data  Systems  (ADDS)  Consul  980  cathode  ray  tube  terminals 
connected  to  an  Amdahl  470/V7  computer  via  time  sharing  option  facilities. 


Sample 

Examinees  were  volunteers  from  an  undergraduate  introductory  measure¬ 
ment  course.  A  total  of  88  students  participated,  21  male,  and  67  female. 
There  were  19  juniors,  67  seniors,  and  2  graduate  students.  The  tailored 
tests  were  administered  shortly  after  a  classroom  test  over  the  same  con¬ 
tent.  Examinees  were  told  that  the  tailored  test  score  would  be  substituted 
for  the  classroom  test  score  If  they  performed  better  on  the  tailored 
test,  and  that  they  would  receive  extra  credit  points  for  completing  the 
requirements  of  the  study. 


-12- 


Attitude  Survey 

In  addition  to  taking  the  tailored  tests,  each  examinee  was  asked 
to  fill  out  an  attitude  survey  at  the  end  of  each  session.  The  survey 
had  20  Items,  written  In  Likert  scale  format  with  a  five  position  scale 
of  response  alternatives.  The  surveys  were  scored  with  a  one  for  the 
response  least  favorable  toward  the  tailored  test  and  a  five  for  the  res¬ 
ponse  most  favorable. 


Analyses 

The  research  questions  In  this  study  Included  a  comparison  of  test- 
retest  reliabilities,  goodness  of  fit,  content  validity,  and  total  test 
information  functions.  In  addition,  comparisons  were  made  between  ability 
estimates  yielded  by  the  1  PL  and  3PL  procedures,  and  between  the  ability 
estimates  and  outside  criteria.  Attitudes  of  the  students  toward  tailored 
testing  were  also  determined.  Estimated  true  scores  were  used  In  the 
computation  of  all  the  correlations,  based  on  the  suggestion  of  Lord  (1979). 

The  computation  of  the  estimated  true  scores  was  accomplished  by 
suiraning  the  probabilities  of  correct  responses  at  the  examinee's  final 
ability  estimate  for  all  the  items  in  the  item  pool.  The  formula  for 
estimated  true  scores  is  as  follows: 


t(e.)  =  e  ?Aq.)  (6) 

J  1=1  1  J 

A 

where  t(e^)  is  the  estimated  true  score  for  Examinee  j. 

The  reliabilities  computed  for  this  study  were  not  strictly  test- 
retest  reliabilities,  but  rather  a  mixture  of  test-retest  and  equivalent 
forms  reliabilities  since  the  tests  in  one  session  were  not  identical 
to  tests  taken  in  the  other  session.  The  reliabilities  were  compared 
using  a  t-test  based  on  Fisher's  £  to  z_  transformation. 

The  total  test  information  analyses  were  done  to  compare  the  rela¬ 
tive  efficiencies  (Bimbaum,  1968)  of  the  tailored  testing  procedures 
with  respect  to  the  course  exam.  The  relative  efficiency  was  the  ratio 
of  the  information  provided  by  the  tailored  test  at  a  particular  ability 
to  the  information  of  the  traditional  paper-and-pencil  course  exam  at 
the  same  ability.  Plots  were  drawn  of  the  relative  efficiency  curves 
for  the  two  tailored  testing  models  based  on  sample  cases  selected  from 
across  the  entire  range  of  the  tailored  testing  ability  estimates. 

Other  analyses  run  on  the  data  included  a  series  of  correlational 
analyses.  For  instance,  correlations  between  the  1PL  and  3PL  ability 
estimates  were  run  using  estimated  true  scores,  as  were  correlations  between 
the  ability  estimates  and  course  exam  scores.  The  exams  that  were  corre¬ 
lated  with  estimated  true  scores  included  the  course  exam  over  the  same 
content  area  as  the  tailored  tests  as  well  as  two  other  course  exams  and 
the  sum  of  all  the  course  exams.  The  objective  of  all  these  correlational 
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analyses  was  to  see  whether  the  1  PL  and  3PL  tests  measured  the  same  thing, 
and  whether  one  test  correlated  more  highly  with  the  outside  criteria. 

The  correlations  of  the  tailored  test  scores  and  the  outside  criteria 
were  an  Indication  of  concurrent  validity.  In  addition  to  the  above 
analyses,  descriptive  statistics  were  compiled,  Including  the  average 
test  lengths,  the  average  test  difficulties,  and  the  number  of  Items  used 
from  each  Item  pool,  for  both  sessions  of  the  1PL  and  3PL  tests. 

The  goodness  of  fit  statistic  used  in  this  study  was  the  mean  square 
deviation,  calculated  by  sumnlng  over  examinees  the  squared  differences 
between  the  actual  responses  to  the  items  and  the  expected  responses  to 
the  Items  (probability  of  a  correct  response)  as  predicted  by  the  models. 
The  formula  for  the  MSD  statistic  is 


MSD . 

J 


:  <ui.t  -  pi(ej»2 
1=1  nj 


(7) 


where  MSD^  Is  the  mean  squared  deviation  for  Examinee  j,  u^j  is  the  actual 

response  to  Item  i  by  Examinee  j,  P.j(9j)  the  probability  of  a  correct 

response  to  Item  i  by  Examinee  j  determined  from  the  model  using  the  final 
ability  estimate  and  the  estimated  item  parameters,  and  nj  is  the  number 

of  items  in  the  tailored  test  for  Examinee  j.  The  MSD  statistic  was  com¬ 
puted  for  a  systematic  sample  of  29  examinees  from  across  the  ability 
range.  The  1PL  and  3PL  tests  were  compared  using  the  MSD  statistic  as 
the  dependent  variable  in  a  dependent  t-test. 

Content  validity  analyses  were  done  to  determine  the  degree  to  which 
the  item  pools  and  the  tailored  tests  accurately  represented  the  content 
breakdown  of  the  traditional  test.  Actual  and  expected  frequencies  of 

2 

content  samplings  were  compared  using  a  x  statistic.  Since  the  argument 
was  presented  that  achievement  tests  are  typically  multidimensional,  fac¬ 
tor  analyses  were  also  run  on  the  course  exam  to  determine  the  factor 
structure  of  the  test.  Principal  components  analyses  with  varimax  rota¬ 
tions  were  employed. 

Student  attitudes  were  analyzed  using  data  from  the  surveys  admin¬ 
istered  at  the  end  of  each  session.  The  first  analysis  run  on  the  response 
data  was  a  principal  components  factor  analysis  followed  by  a  varimax 
rotation.  Once  the  factor  structure  was  determined,  attempts  were  made 
to  label  factors  and  compare  them  with  the  factors  from  previous  adminis¬ 
trations  of  the  scale  reported  by  Koch  and  Reckase  (1978,  1979).  Coeffi¬ 
cient  alpha  reliabilities  were  calculated  for  each  factor  as  well  as  for 
the  total  scale.  Response  frequencies  for  the  five  scale  positions  were 
tabulated  for  both  sessions  to  summarize  student  attitudes  toward  tailored 
testing.  Also,  multivariate  analyses  were  run  to  determine  if  there  were 
significant  change  in  attitudes  from  one  session  to  the  next. 


*  ^ 
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Results 


Reliability 

Table  2  contains  the  correlation  matrix  obtained  from  Intercorrela- 
tlng  the  ability  estimates  yielded  by  the  two  models  used  In  the  tailored 
testing  sessions.  The  correlation  of  r  ■  .57  between  the  ability  esti¬ 
mates  from  the  first  1PL  test  (1PL  1)  and  the  ability  estimates  from  the 
second  1PL  test  (1PL  2)  was  the  reliability  for  the  1  PL  procedure.  The 
reliability  for  the  3PL  procedure,  r  *  .62,  was  higher,  but  not  signifi¬ 
cantly  so.  The  KR-20  reliability  of  the  traditional  paper-and-pencll 
course  exam  was  r  ■  .60.  The  reliabilities  of  the  tailored  tests  were 
actually  substantially  higher  than  the  reliability  of  the  conventional 
test,  since  normally  a  KR-20  reliability  would  be  expected  to  be  higher 
than  a  test-retest  reliability.  Also,  it  should  be  noted  that  the  tailored 
tests  were  less  than  half  as  long  as  the  conventional  test. 

Table  2 


Ability  Estimate  Correlations 


Model 

Session 

1 

2 

3 

4 

1.  1PL 

1 

1.00 

.57 

.35 

.42 

2.  1PL 

2 

1.00 

.38 

.44 

3.  3PL 

1 

1.00 

.62 

4.  3PL 

2 

1.00 

Table  3  shows  that  the  tailored  test  reliabilities  were  even  higher 
when  estimates  true  scores  were  used  In  place  of  ability  estimates.  Using 
estimated  true  scores,  the  1  PL  reliability  was  r  *  .62  and  the  3PL  relia¬ 
bility  was  r  =  .71 . 


Table  3 


Ability  Estimate  Correlations  Using  Estimated  True  Scores 


Model 

Session 

1 

2 

3 

4 

1.  1PL 

1 

1.00 

.62 

.36 

.44 

2.  1PL 

2 

1.00 

.41 

.52 

3.  3PL 

1 

1.00 

.71 

4.  3PL 

2 

1.00 

Information 


The  relative  efficiency  comparison  of  the  total  test  information 
for  the  1PL  and  3PL  procedures  Is  shown  In  Figure  4.  The  horizontal 


-15- 


broken  line  represents  the  relative  efficiency  of  the  course  exam,  which  was  used 
as  a  standard  for  comparing  the  two  procedures.  It  should  be  noted  that 
the  ability  scale  for  the  1  PL  model  Is  not  the  same  as  the  ability  scale 
for  the  3 PL  model.  Thus  the  plots  are  not  comparable  on  a  point  by  point 
basis.  However,  an  overall  visual  examination  of  the  plots  of  Informa¬ 
tion  curves  for  the  two  models  Is  still  possible. 

FIGURE  4 

TOTAL  TEST  INFORMATION 
COMPARISON  OF  1PL  AND 


3PL  TAILORED  TESTS 
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Perhaps  the  most  significant  result  of  this  comparison  Is  that  the 
3PL  procedure  not  only  yielded  more  Information  than  the  1PL  procedure , 
but  In  the  ability  estimate  range  of  -2.0  to  +1.5  the  3PL  procedure  also 
yielded  more  Information  than  the  50  item  paper-and-pencll  test.  It  is 
Important  to  point  out  that  the  3PL  procedure  performed  best  in  that  range 
of  ability  estimates  where  most  of  the  examinees  were  classified,  while 
the  1PL  procedure  had  Its  highest  relative  efficiency  at  the  upper  end 
of  the  range  of  ability  estimates,  where  few  examinees  were  classified. 

See  Appendix  B  for  the  distribution  of  ability  estimates  for  both  the 
1  PL  and  3PL  procedures. 


Goodness  of  fit 


Table  4  presents  the  results  of  the  goodness  of  fit  comparison  of 
the  1  PL  and  3PL  models  using  the  NSD  statistic.  MSD  values  were  computed 
for  29  cases  for  each  model,  along  with  means,  standard  deviations,  and 
the  results  of  a  dependent  t-test  analysis  of  the  data.  The  results  of 
the  t-test  Indicated  that  tFe  3PL  model  fit  the  observed  responses  signi¬ 
ficantly  better  than  the  1  PL  model  (£  <  .001). 


Correlational  Analyses 

Table  5  and  6  show  the  correlations  of  the  traditional  course  exam 
scores  and  total  course  scores  (the  sum  of  the  course  exam  scores)  with 
the  tailored  testing  ability  estimates  and  with  the  estimated  true  scores, 
respectively.  The  differences  between  the  correlations  of  the  1PL  and 
3PL  ability  estimates  with  Exam  I  were  not  significant,  while  the  1PL 
correlation  was  significantly  higher  than  the  3PL  correlation  with  re¬ 
spect  to  the  total  score  for  the  first  session  (£  <  .05)  but  not  for  the 
second.  The  correlations  did  not  change  significantly  when  estimated 
true  scores  were  used  Instead  of  ability  estimates. 

One  interesting  result  that  is  shown  in  Table  5  is  that  the  1PL  1 
ability  estimates  correlated  significantly  higher  with  Exam  II  than  with 
Exam  I  (£  <  .05).  Moreover,  both  the  1PL  1  and  the  1PL  2  ability  esti¬ 
mates  correlated  higher  with  the  total  course  score  than  with  Exam  I 
(£  <  .01  for  1PL  1,  £  <  .05  for  1PL  2).  Remember  that  Exam  I  was  the 
course  exam  over  the  same  material  as  the  tailored  tests.  One  possible 
explanation  for  this  Is  that  the  KR-20  reliabilities  of  Exam  II  and  the 
total  course  score  were  higher  than  the  reliability  of  Exam  I.  The  relia¬ 
bility  of  the  total  course  score  was  computed  according  to  a  method  suggest 
ed  by  Lord  and  Novick  (1968,  pp.  203-204).  These  reliabilities  are  shown 
in  Table  5. 


I  Descriptive  Statistics 

'  Table  7  presents  descriptive  statistics  for  both  sessions  of  the 

1PL  and  the  3PL  tailored  tests.  The  mean  number  of  Items  administered 
i  Indicates  that  the  1PL  tests  tended  to  be  longer  than  the  3PL  tests,  and 

i  that  many  of  the  1PL  tests  went  the  maximum  of  20  Items.  The  mean  pro- 

T, 
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Table  4 

Goodness  of  Fit  Comparison 
Using  the  MSD  Statistic 


Observations 

One-Parameter 

HSO 

Three-Parameter 

MSD 

1 

.1887 

.1103 

2 

.1833 

.0142 

3 

.1863 

.0832 

4 

.2085 

.1894 

5 

.2123 

.1226 

6 

.2087 

.1394 

7 

.1853 

.0349 

8 

.2107 

.1137 

9 

.2133 

.2273 

10 

.2174 

.1216 

11 

.1923 

.2405 

12 

.2219 

.2515 

13 

.2120 

.1826 

14 

.2197 

.2171 

15 

.2192 

.0728 

16 

.2033 

.1712 

17 

.2176 

.1984 

18 

.2124 

.2024 

19 

.2122 

.2305 

20 

.2015 

.1616 

21 

.2095 

.0457 

22 

.1883 

.1309 

23 

.2230 

.2107 

24 

.1367 

.0235 

25 

.2086 

.1751 

26 

.2177 

.2281 

27 

,2087 

.1330 

28 

.2137 

.0994 

29 

< 

.2097 

.1693 

X 

.2049 

.1483 

*7 

.0425 

.0740 

^28) 

■  5.082 

(£  <  .001) 

portion  of  Items  answered  correctly  shows  that  the  3PL  procedure  admin¬ 
istered  Items  that  were,  overall,  easier  than  those  Items  administered 
by  the  1  PL  procedure. 

An  important  effect  related  to  the  3PL  item  discrimination  parameter 
estimates  was  that  only  25  items  from  the  183  Items  In  the  3PL  item  pool 
were  used  by  the  3PL  testing  procedure.  On  the  other  hand  the  1 PL  pro¬ 
cedure  used  120  items  from  the  183  Items  In  the  1  PL  Item  pool.  Figure 
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Table  5 

Correlations  of  Ability  Estimates  with  Traditional  Course  Exams 


Tailored  Testing  Model  and  Session 


Traditional 
Course  Exam 

KR-20 

Reliability 

1PL  1 

1  PL  2 

3PL  1 

3PL  2 

Exam  I* 

.60 

.42 

.49 

.39 

.42 

Exam  II 

.76 

.58 

.46 

.36 

.47 

Exam  II 

.64 

.36 

.35 

.38 

.44 

Total  Score 

.75 

.68 

.63 

.45 

.52 

♦Exam  I  was  over  the  same  content  area  as  the  tailored  tests. 

Table  6 

Correlations  of  Estimated  True  Scores 
with  Traditional  Course  Exams 

Tailored  Testing  Model  and 

Session 

Traditional 

Course  Exam  1PL  1  1PL  2  3PL  1 

3PL  2 

Exam  I* 

.42  .49  .40 

.42 

Exam  II 

.58  .  46  .  36 

.44 

Exam  III 

.37  .33  .40 

.44 

Total  Score 

.68  .62  .46 

.51 

♦Exam  I  was 

over  the  same  content  area  as  the  tailored  tests 

• 

Table  7 

Tailored  Test  Descriptive  Statistics 

One-Parameter  Three -Parameter 

Tailored  Test  Tailored  Test 

Variable  - 

Session  1  Session  2  Session  1  Session  2 


Mean  #  of  items  administered 

19.09 

18.11 

16.23 

15.32 

Mean  #  of  Items  correct 

11.07 

10.30 

12.15 

11.71 

Mean  proportion  of  Items  correct 

.58 

.57 

.75 

.76 

Mean  of  ability  estimates 

1.37 

1.50 

-.53 

-.36 

S.D.  of  ability  estimates 

.67 

.92 

.74 

.83 

5  shows  the  Information  curve  for  the  25  items  that  were  used  from  the 
3PL  item  pool.  The  plot  shows  that  the  most  information  yielded  by  this 
reduced  pool  was  at  the  lower  range  of  abilities.  In  fact,  for  ability 
estimates  over  +2.0  there  were  virtually  no  items  available  with  informa¬ 
tion  above  the  information  cutoff. 


FIGURE  5 


INFORMATION  CURVE  FOR  25 
ITEMS  USED  FROM  3PL  POOL 


0.50 

THETA 


Content  Validity 

Table  8  shows  the  results  of  the  content  validity  analysis  for  both 
tailored  testing  models.  The  Chi-Square  test  indicated  that  both  the  1PL 
and  the  3PL  Item  pools  accurately  reflected  the  weighting  of  the  content 
areas  specified  In  the  table  of  specifications  for  the  paper-and-pencil 
course  exam  (see  Table  A-2  in  Appendix  A).  However,  the  number  of  items 
administered  by  content  area  for  a  systematic  sample  of  21  1PL  tailored 
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tests  and  20  3PL  tailored  tests  showed  significant  lack  of  fit  to  both 
the  item  pools  and  the  course  exam.  Also,  the  content  distributions  of 
the  1  PL  and  3PL  tailored  test  items  were  significantly  different.  It 
should  be  noted  that  no  attempt  was  made  in  the  tailored  testing  proce¬ 
dures  to  branch  among  the  content  areas.  The  object  was  to  see  if  select¬ 
ing  items  for  administration  on  the  basis  of  information  alone  would 
approximate  the  content  area  weightings  of  the  item  pools  and  the  course 
exam. 

Attitude  Survey 

Attitude  Scale  Characteristics  Table  9  shows  the  varimax  rotated 
factor  loading  matrix  obtained  from  a  principal  components  analysis  of 
the  first  administration  of  the  attitude  scale.  There  were  six  factors 
present  with  eigenvalues  greater  than  one,  accounting  for  62.5  percent 
of  the  variance.  A  subjective  examination  of  the  items  loading  on  each 
factor  resulted  in  the  following  factor  labels: 

factor  I  -  cathode  ray  tube  (CRT)  characteristics 
factor  II  -  perceived  test  performance/test  satisfaction 
factor  III  -  motivation 
factor  IV  -  anxiety 
factor  V  -  test  pace 
factor  VI  -  time  pressure/item  easiness 
The  items  appearing  on  the  attitude  scale  are  listed  in  Appendix  C. 

Table  10  shows  the  rotated  factor  loading  matrix  obtained  from  the 
analysis  of  the  second  administration  of  the  attitude  scale.  This  time 
there  were  five  factors  present  with  eigenvalues  greater  than  one,  account¬ 
ing  for  62  percent  of  the  variance.  After  a  subjective  examination,  these 
factors  were  given  the  following  labels: 

factor  I  -  perceived  test  performance/test  satisfaction 

factor  II  -  motivation 

factor  III  -  anxiety/time  pressure 

factor  IV  -  miscellaneous 

factor  V  -  CRT  characteristics/item  easiness 


Factor  analysis  results  obtained  from  the  two  attitude  scale  admin¬ 
istrations  differed  somewhat.  For  instance,  in  the  first  administration 
of  the  scale,  anxiety  and  time  pressure  items  loaded  on  separate  factors, 
while  In  the  second  administration  they  formed  a  single  factor.  Another 
difference  was  that  in  the  first  administration,  item  easiness  items  load¬ 
ed  with  time  pressure  items,  while  in  the  second  administration  item  easi¬ 
ness  items  loaded  with  CRT  characteristics  items.  Also,  in  the  first 
administration  Item  11  loaded  by  itself,  while  in  the  second  administration 
it  was  joined  by  three  other  items  in  a  factor  of  assorted  item  types, 
labelled  here  as  miscellaneous. 

A  multivariate  analysis  of  variance  (MANOVA)  was  performed  to  deter¬ 
mine  whether  the  mean  scores  on  each  item  were  different  over  the  two 
administrations  of  the  attitude  scale.  The  results  of  the  MANOVA  indica¬ 
ted  that  there  were  no  significant  changes.  This  implied  that,  regardless 
of  the  changes  in  factor  structure,  student  attitudes  toward  tailored 
testing  did  not  change  from  one  administration  to  the  next. 
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Table  8 

Test  Items  by  Content  Area  for  the  Course  Exam, 
Item  Pools,  and  Tailored  Tests 


Course  Exam 
Items 

Items  in 

1  PL  Pool 

Items  in 
3PL  Pool 

Items  in  Items  in 

21  1PL  20  3PL 

Tailored  Tests  Tailored  Tests 

Number 

Number  % 

Number 

% 

Number 

Number 

% 

Anecdotal 

Records 

5 

10.0 

18 

9.8 

18 

9.8 

62 

15.3 

0 

0 

Behavioral 

Objectives 

5 

10.0 

20 

10.9 

20 

10.9 

43 

10.6 

47 

15.5 

Checklists 

5 

10.0 

18 

9.8 

18 

9.8 

37 

9.1 

21 

6.9 

Peer  Appraisals 

2 

4.0 

5 

2.7 

5 

2.7 

6 

1.5 

14 

4.6 

Planning  Tests 

3 

6.0 

12 

6.6 

12 

6.6 

38 

9.4 

18 

5.9 

Rankings 

3 

6.0 

9 

4.9 

9 

4.9 

4 

1.0 

20 

6.6 

Ratings 

6 

12.0 

25 

13.7 

25 

13.7 

76 

18.7 

64 

21.1 

Selection  Items 

8 

16.0 

30 

16.4 

30 

16.4 

55 

13.6 

59 

19.5 

Self  Report 

2 

4.0 

8 

4.4 

8 

4.4 

12 

3.0 

14 

4.6 

Supply  Items 

5 

10.0 

20 

10.9 

20 

10.9 

32 

7.9 

9 

3.0 

Table  of 
Specifications 

6 

12.0 

18 

9.8 

18 

9.8 

41 

10.1 

37 

12.2 

Total 

50 

183 

183 

406 

303 

Note.  Below  are  the  Chi-Square  values  for  several  comparisons.  The  critical 
value  for  rejection  of  adequate  fit  is  x^q)  >  18.31  at  «  =  .05. 

.  2 

1.  Course  exam  items  vs.  items  in  1PL  pool,  x  =4.431 

2 

2.  Course  exam  items  vs.  1PL  tailored  test  items,  x  =  55.078 

3.  1  PL  pool  items  vs.  1PL  tailored  test  items,  x2  =  43.139 

2 

4.  Course  exam  items  vs.  items  in  3PL  pool,  x  =4.431 

2 

5.  Course  exam  items  vs.  3PL  tailored  test  Items,  x  =  80.878 

6.  3PL  pool  items  vs.  3PL  tailored  test  items,  x2  =  77.662 

7.  1PL  tailored  test  items  vs.  3PL  tailored  test  items,  x2  =  89.02 


Table  9 


Principal  Components  Analysis 
Varimax  Rotated  Factor  Pattern  for 
First  Attitude  Survey  Administration 


Item  No. 

Factor 

I 

II 

III 

IV 

V 

VI 

1 

-.06 

.24 

-.28 

.47 

.45 

.03 

2 

.15 

.09 

.06 

UT 

-TIT 

.76 

3 

-.23 

.68 

-.12 

.01 

-.40 

T6T 

4 

.23 

-155" 

.19 

.52 

-.10 

.43 

5 

-.08 

.11 

.78 

-.06 

.2T 

6 

-.19 

.64 

73T 

.29 

.24 

.03 

7 

.22 

775" 

.04 

.12 

-.14 

.12 

8 

.71 

TOT 

.08 

-.00 

-.09 

.05 

9 

7TT 

-.10 

.26 

-.20 

.53 

.58 

10 

.02 

.14 

-.13 

.72 

-.0T 

-7TC" 

11 

.31 

-.04 

.13 

767 

-.60 

.19 

12 

.25 

.64 

-.10 

-.03 

TTT 

-.13 

13 

.74 

7TT 

.06 

.19 

.10 

.04 

14 

T7T 

.65 

.15 

.16 

.22 

-.31 

15 

.41 

720" 

-.09 

.62 

-.05 

.24 

16 

.72 

.17 

.13 

-.06 

-.20 

.23 

17 

T?0 

.78 

.30 

-.02 

-.05 

.27 

18 

-.31 

-.05 

.43 

.49 

-.01 

-.10 

19 

.18 

.04 

T7T 

-7TT 

-.16 

.02 

20 

.32 

.19 

76T 

-.09 

.16 

-.00 

Note.  The  underlined  values  indicate  the  highest  loadings  of  an  item 
on  a  factor.  Broken  underlines  indicate  other  high  loadings. 


A  comparison  of  the  results  of  the  attitude  scale  administrations 
for  this  study  with  results  from  previous  administrations  of  the  scale 
indicated  several  differences.  For  instance,  in  the  earliest  administra¬ 
tion  of  the  scale  (Koch  and  Reckase,  1978)  anxiety  and  time  pressure  items 
loaded  on  separate  factors,  while  in  a  subsequent  study  (Koch  and  Reckase, 
1979)  they  formed  a  single  factor.  In  the  present  study,  they  loaded  on 
separate  factors  in  the  first  administration,  and  on  the  same  factor  in 
the  second  administration.  In  both  of  the  earlier  studies  perceived 
test  performance  and  test  satisfaction  items  loaded  on  separate  factors, 
while  in  the  present  study  they  formed  a  single  factor  in  both  adminis¬ 
trations. 

Two  types  of  reliability  measures  were  computed  for  the  attitude 
scale.  First,  a  test-retest  reliability  coefficient  was  computed  between 
the  sets  of  total  attitude  scores  for  the  two  administrations.  A  value 
of  r  «  .71  was  obtained  for  this  reliability  measure.  The  second  type 
of  reliability  measure  calculated  for  the  attitude  scale  was  a  coefficient 
alpha  reliability.  Coefficient  alpha  reliabilities  were  computed  for  each 
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Table  10 

Principal  Components  Analysis 
Varimax  Rotated  Factor  Pattern  for 
Second  Attitude  Survey  Administration 


Factor 


Item  No. 

I 

II 

III 

IV 

V 

1 

.20 

-.40 

.14 

.57 

.05 

2 

.03 

.25 

.66 

-777 

-.04 

3 

.23 

.19 

TOT 

.72 

.19 

4 

-.17 

.39 

.64 

7T7 

.25 

5 

.25 

.77 

-7TT 

-.08 

.08 

6 

.58 

-7TT 

.15 

.43 

-.25 

7 

779 

.25 

.07 

707 

-.03 

8 

-TTO 

.15 

.18 

.13 

.79 

9 

-.03 

.34 

-.12 

-.26 

7TT 

10 

.23 

-.27 

.75 

.27 

-707 

11 

.26 

-.05 

720 

-.65 

.24 

12 

.65 

.03 

-.10 

TOT 

.40 

13 

707 

.00 

.72 

.07 

.45 

14 

.51 

.02 

77T 

.69 

-.09 

15 

.36 

-.39 

.58 

-TOT 

.10 

16 

.27 

.00 

73T 

-.14 

.64 

17 

.83 

.20 

.17 

.11 

.03 

18 

709 

.59 

-.03 

.17 

-.12 

19 

.22 

.60 

.12 

-.11 

.24 

20 

-.04 

76T 

.16 

.01 

.24 

Note.  The  underlined  values  indicate  the  highest  loading  of  an  item 
on  a  factor.  Broken  underlines  indicate  other  high  loadings. 


factor  and  for  the  total  scale  for  both  administrations  of  the  instru¬ 
ment.  The  results  are  shown  in  Table  11  for  the  first  administration 
and  in  Table  12  for  the  second  administration.  Overall  these  reliabili¬ 
ties  were  fairly  high.  However,  for  the  first  administration,  the 
reliability  of  the  time  pressure/item  easiness  factor  was  relatively 
low.  Note  that  in  the  second  administration  these  two  item  types  did 
not  load  together.  In  the  second  administration  the  only  factor  not  hav¬ 
ing  a  high  reliability  coefficient  was  the  miscellaneous  factor. 

Item  discrimination  indices  were  calculated  for  the  items  on  the 
attitude  survey  by  correlating  individual  item  scores  with  the  total 
scores  for  each  examinee.  These  values  are  shown  in  Table  13.  Discrim¬ 
inations  were  relatively  constant  across  the  two  administrations,  with 
the  exception  of  Item  10. 

Attitude  Scale  Results  Responses  obtained  from  the  administration 
Qf  the  attitude  scale  are  summarized  in  Table  14.  Response  percentages 
for  the  five  categories  for  each  item  are  shown  for  both  administrations. 
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Table  11 


Coefficient  Alpha  Reliabilities  for  Attitude 
Survey  Factors  and  Total  Scale  for  Session  I 


Factor  Labels 

Items 

Coeff.  o 

I. 

CRT  Characteristics 

8,  13,  16 

.69 

II. 

Perceived  Test  Performance/ 
Test  Satisfaction 

3,  6,  7,  12,  14,  17 

.79 

III. 

Motivation 

5,  19,  20 

.66 

IV. 

Anxiety 

1,  4,  10,  15,  18 

.52 

V. 

Time  Pressure/Item  Easiness 

2,  9 

.28 

Total  Scale 

all  20  items 

.75 

Note.  Item  11  loaded  on  its  own  factor,  so  no  coefficient  a  could  be 
calculated  for  it  alone. 


Table  12 


Coefficient  Alpha  Reliabilities  for  Attitude 
Survey  Factors  and  Total  Scale  for  Session  II 


Factor  Labels 

Items 

Coeff.  a 

I. 

Perceived  Test  Performance/ 

Test  Satisfaction 

6,  7,  12,  17 

.77 

II. 

Motivation 

5,  18,  19,  20 

.66 

III. 

Anxiety /Time  Pressure 

2,  4,  10,  13,  15 

.74 

IV. 

Miscellaneous 

1,  3,  11,  14 

.22 

V. 

CRT  Characteristics/Item  Easiness 

8,  9,  16 

.55 

Total  Scale 

all  20  items 

.77 

Overall  the  results  of  the  attitude  survey  were  positive  regarding 
attitudes  toward  the  tailored  testing  situation.  Examinees  indicated 
that  they  felt  less  time  pressure  when  taking  the  tailored  test  than 
when  taking  the  conventional  test.  However,  responses  indicated  a  split 
over  whether  the  examinees  felt  that  they  did  well  on  the  tailored  test, 
and  many  examinees  remained  neutral  on  those  items  dealing  with  test  per¬ 
formance.  Examinees  indicated  that  they  were  motivated  to  do  well  on 
the  test,  but  felt  little  anxiety  or  stress.  The  examinees  responded 
that  they  felt  comfortable  with  the  CRTs,  and  that  the  screens  were  not 
difficult  to  read.  Test  Items  were  apparently  perceived  as  neither  too 
difficult  nor  too  easy,  but  examinees  were  split  over  whether  they  believed 
the  tailored  tests  reflected  their  true  knowledge  of  the  material.  No 
significant  correlations  were  found  between  the  attitude  scores  and  the 
ability  estimates. 


Table  13 


Discrimination  Indices  for  Attitude  Scale 
Items  for  Two  Test  Sessions 


Item  No. 

Session  I 

Session  II 

1 

.26 

.28 

2 

.41 

.41 

3 

.29 

.43 

4 

.45 

.52 

5 

.41 

.36 

6 

.48 

.35 

7 

.59 

.57 

8 

.46 

.45 

9 

.18 

.18 

10 

.28 

.52 

11 

.31 

.28 

12 

.41 

.51 

13 

.54 

.65 

14 

.44 

.51 

15 

.59 

.46 

16 

.56 

.58 

17 

.72 

.65 

18 

.22 

.28 

19 

.33 

.46 

20 

.46 

.37 

Discussion 


In  order  to  fully  understand  the  results  of  the  research  reported 
here,  the  results  of  three  tailored  testing  studies  should  be  kept  in 
mind:  (a)  the  application  of  tailored  testing  models  to  a  vocabulary 
test  (Koch  and  Reckase,  1978),  (b)  a  previous  attempt  to  applv  tailored 
testing  models  to  achievement  testing  (Koch  and  Reckase,  1979),  and  (c) 
the  current  study.  The  first  study,  using  the  vocabulary  test,  was  success¬ 
ful  ,  but  the  success  was  not  surprising,  since  the  vocabulary  test  used 
was  highly  unidimensional.  However,  nonconvergence  of  the  ability  esti¬ 
mates  was  found  to  be  a  problem.  The  high  nonconvergence  rate  was  felt 
to  be  due  to  the  inappropriate  difficulty  of  the  item  pool.  When  an 
attempt  was  made  to  apply  tailored  testing  to  a  multidimensional  achieve¬ 
ment  test,  the  nonconvergence  problem  was  reduced  through  the  use  of  items 
of  appropriate  difficulty,  but  other  problems  were  encountered  (e.g., 
low  reliabilities),  and  the  attempt  at  application  was  unsuccessful. 

There  were  indications  that  the  lack  of  success  in  this  first  achieve¬ 
ment  testing  study  might  have  been  due  to  factors  other  than  the  multi¬ 
dimensional  nature  of  the  test,  such  as  the  linking  procedures  used  with 
the  calibrations.  The  current  study,  in  which  Improvements  were  made 
in  the  operational  characteristics  of  the  tailored  testing  procedures 
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Table  14 


Attitude  Scale  Response  Percentages  for 
Item  Alternatives  over  Both  Sessions 


Item  No. 

Session 

1 

2 

SA 

A 

N 

D 

SD 

SA 

A 

N 

D 

SD 

1 

6 

43 

16 

27 

8 

5 

20 

17 

39 

19 

2 

32 

45 

9 

10 

3 

20 

49 

15 

15 

1 

3 

5 

53 

34 

6 

2 

7 

60 

25 

8 

0 

4 

1 

9 

6 

47 

38 

1 

6 

7 

49 

38 

5 

27 

48 

17 

8 

0 

18 

53 

15 

11 

2 

6 

1 

26 

63 

10 

0 

0 

39 

55 

7 

0 

7 

5 

27 

26 

39 

3 

1 

39 

30 

31 

0 

8 

13 

23 

13 

35 

17 

6 

24 

10 

42 

18 

9 

6 

72 

23 

0 

0 

6 

73 

22 

0 

0 

10 

0 

14 

8 

44 

34 

3 

6 

6 

47 

39 

11 

17 

53 

8 

18 

3 

13 

51 

10 

19 

7 

12 

8 

48 

24 

20 

0 

2 

40 

31 

26 

1 

13 

3 

5 

3 

58 

31 

1 

6 

6 

52 

35 

14 

0 

15 

48 

38 

0 

1 

17 

50 

32 

0 

15 

38 

43 

11 

7 

1 

31 

58 

3 

6 

2 

16 

25 

55 

9 

10 

1 

23 

55 

11 

10 

1 

17 

1 

38 

35 

22 

5 

0 

28 

40 

27 

5 

18 

1 

38 

19 

31 

11 

2 

35 

22 

34 

7 

19 

0 

0 

5 

60 

35 

0 

2 

5 

66 

27 

20 

1 

1 

5 

51 

42 

0 

2 

9 

56 

33 

Note.  SA  =  Strongly  Agree,  A  *  Agree,  N  *  Neutral,  D  »  Disagree,  SD  * 
Strongly  Disagree.  For  a  list  of  the  actual  Items,  see  Appendix 
C. 


and  the  linking  procedures,  demonstrated  that  tailored  testing  could  be 
successfully  applied  to  a  multidimensional  test.  If  reliability  and  Infor¬ 
mation  functions  were  used  as  criteria.  Indeed,  the  current  study  employed 
virtually  the  same  Item  pool  as  the  first  tailored  achievement  testing 
study,  but  the  results  were  quite  different.  The  difference  between  these 
two  achievement  studies  was  not  In  the  dimensionality  of  the  Item  pool, 
but  In  the  operational  characteristics  of  the  procedures  employed.  The 
changes  that  were  made  and  their  effects  will  now  be  discussed. 


Reliability 

A  number  of  changes  Implemented  during  the  design  of  the  current 
study  probably  contributed  to  the  gain  in  the  tailored  test  reliabilities 
over  the  previous  tailored  testing  achievement  study.  One  such  change 
was  the  Improvement  of  the  linking  procedures  that  were  employed.  The 
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1PL  item  parameter  estimates  were  linked  using  the  same  method  as  was 
used  in  the  previous  studies.  However,  previously  the  linking  had  been 
done  by  hand,  while  this  time  computer  programs  were  used  to  perform  the 
linking.  Therefore,  any  computational  errors  that  might  have  occurred 
in  linking  should  have  been  eliminated.  For  this  study  the  3PL  calibra¬ 
tions  were  linked  using  the  Maximum  Likelihood  Method,  rather  than  the 
Least  Squares  Method  that  had  been  used  earlier.  Again  linking  was  per¬ 
formed  by  computer  programs  instead  of  by  hand.  These  improvements  in 
linking  provided  more  accurate  item  parameter  estimates  for  the  items 
in  the  pools. 

Another  important  change  was  that  larger  sample  sizes  were  used 
for  item  calibration.  Sample  sizes  used  ranged  from  148  to  314,  with 
a  mean  sample  size  of  226.5.  These  were  not  much  larger  than  the  sample 
sizes  used  in  previous  studies  for  the  1PL  calibrations,  but  they  were 
somewhat  larger  than  the  sample  sizes  used  previously  for  the  3PL  cali¬ 
brations.  In  the  previous  tailored  achievement  testing  study  the  1PL 
sample  sizes  ranged  from  96  to  314,  with  a  mean  of  212.82,  while  3PL 
sample  sizes  ranged  from  97  to  314,  with  a  mean  sample  size  of  195.4. 

The  larger  sample  sizes  may  have  yielded  more  stable  parameter  estimates 
than  the  previous  smaller  sample  sizes,  although  Reckase  (1977)  found 
that  these  sample  sizes  were  still  inadequate  for  the  3PL  calibration. 

Other  important  changes  were  in  the  procedures  used  in  administer¬ 
ing  the  tailored  tests.  For  instance,  entry  points  (initial  ability 
estimates)  for  the  3PL  procedures  were  set  at  the  difficulty  values  on 
either  side  of  the  median  of  the  item  pool  difficulty  distribution.  In 
earlier  studies  the  entry  points  were  arbitrarily  set  to  be  +.5,  because 
the  item  pool  was  assumed  to  be  centered  around  zero.  This  was  found 
to  not  be  the  case.  By  using  entry  points  near  the  median  of  the  diffi¬ 
culty  distribution  more  items  were  available  within  the  fixed  stepsize 
in  either  direction.  Also,  the  fixed  stepsize  that  was  used  was  .4, 
rather  than  the  .693  that  had  previously  been  used  for  the  3PL  proce¬ 
dure.  This  helped  to  avoid  the  previously  encountered  problem  of  moving 
through  the  item  pool  too  quickly,  resulting  in  premature  termination 
of  the  test.  These  changes  in  the  entry  points  and  fixed  stepsize  for 
the  3PL  procedure  were  important  factors  in  the  virtual  elimination  of 
the  problem  of  nonconvergence  and,  together  with  the  improved  calibra¬ 
tions  and  linkings,  probably  accounted  for  the  higher  reliabilities  of 
the  tailored  tests. 


Information 


In  looking  at  the  information  yielded  by  the  tailored  tests  it  should 
be  remembered  that  the  tailored  tests  were  less  than  half  the  length  of 
the  classroom  test.  Since  total  test  information  was  the  sum  of  the  in¬ 
dividual  item  information,  a  drop  in  total  information  would  be  expected 
when  considering  a  shorter  test.  Despite  this,  the  1  PL  tailored  test 
yielded  almost  as  much  Information  as  the  classroom  test,  and  the  3PL 
tailored  test  yielded  more  information  than  the  classroom  test  over  most 
of  the  ability  range. 


i. 


t 

.  w 


1 

I, 


•K  r  '  'p 

...  •' 


-28- 


Goodness  of  fit 

The  superior  fit  of  the  3PL  model  Indicated  that  the  3PL  tailored 
tests  demonstrated  better  'person*  fit  than  did  the  1PL  tests.  It  was 
no  surprise  that  the  three-parameter  model  fit  observed  response  data 
better  than  the  one-parameter  model.  A  model  with  three  parameters  has 
more  flexibility  In  fitting  data  than  a  model  with  only  one  parameter. 
Such  a  finding  is  consistent  with  the  findings  of  previous  studies  (Koch 
and  Reckase,  1978,  1979). 


Correlational  Analyses 

In  correlating  the  tailored  testing  ability  estimates  with  the  out¬ 
side  criterion  variables,  it  was  found  that  the  1PL  1  ability  estimates 
correlated  significantly  higher  with  Exam  II  than  with  Exam  I.  Also, 
both  the  1PL  1  and  1PL  2  ability  estimates  correlated  significantly  higher 
with  the  total  course  score  than  with  Exam  I.  This  is  somewhat  surpris¬ 
ing,  since  Exam  I  was  the  course  exam  over  the  same  content  as  the  tailored 
tests.  However,  this  might  be  explained  by  examining  the  reliabilities 
of  the  course  exams.  The  KR-20  reliabilities  of  Exam  II  and  the  total 
course  score  were  higher  than  the  KR-20  reliability  of  Exam  I.  The  lower 
reliability  of  Exam  I  might  be  limiting  the  magnitude  of  the  correlations 
that  can  be  obtained  using  that  test.  Of  course,  this  would  be  true  for 
correlations  of  Exam  I  with  both  the  1PL  and  3PL  ability  estimates.  One 
reason  why  this  effect  appeared  with  the  1  PL  ability  estimates  and  not 
the  3PL  ability  estimates  might  be  that  since  the  1  PL  calibrations  are 
based  on  the  sum  of  the  factors  the  1PI  tests  might  have  had  factors 
in  coirmon  with  Exam  II.  The  3PL  calibrations  are  based  on  the  dominant 
factor,  which  the  3PL  tests  would  have  in  common  with  Exam  I  but  not  Exam 
II.  Any  sharing  of  factors  between  the  1  PL  tests  and  Exam  II  would  have 
caused  that  correlation  to  be  higher  than  the  correlation  between  the  3PL 
ability  estimates  and  Exam  II.  However,  these  explanations  are  only  con¬ 
jecture,  and  further  studies  are  needed  to  determine  if  these  anomalous 
results  can  be  replicated. 


Content  Validity 

The  content  validity  results  clearly  indicated  that,  even  though 
the  item  pools  reflected  content  area  weightings  proportionate  to  the 
classroom  test,  the  tailored  test  item  selection  procedures  did  not  main¬ 
tain  these  content  weightings.  For  the  3PL  procedure  this  was  not  sur¬ 
prising.  High  item  discriminations  were  not  distributed  evenly  across 
content  areas  and,  since  the  3PL  procedure  selected  items  on  information, 
those  content  areas  having  no  highly  discriminating  items  were  not  repre¬ 
sented  at  all.  Content  areas  with  several  high  discriminators  were  weighted 
too  heavily  relative  to  the  table  of  specifications.  The  reason  for  this 
imbalance  in  the  distribution  of  item  discriminations  was  probably  caused 
by  the  loading  of  the  highly  discriminating  items  on  the  dominant  factor. 
Previous  research  (Reckase,  1977)  had  indicated  that  the  3PL  model  calibrates 
items  based  on  the  dominant  factor  in  the  test,  resulting  in  low  discrim¬ 
ination  values  for  items  loading  on  the  remaining  factors,  while  the  1  PL 
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procedure  calibrates  items  based  on  the  sum  of  the  factors.  Given  these 
contrasting  tendencies.  It  Is  not  surprising  that  the  3PL  tailored  tests 
used  only  25  items  out  of  183,  whereas  the  1PL  tailored  tests  used  120 
items  out  of  183.  This  effect  is  reflected  in  the  low  correlations  between 
the  1  PL  and  3PL  ability  estimates  shown  in  Table  2. 

For  the  1PL  procedure,  however,  item  discriminations  were  assumed 
to  be  equal,  so  the  result  was  somewhat  surprising.  A  possible  explana¬ 
tion  is  that  content  areas  are  not  uniformly  distributed  across  the  diffi¬ 
culty  scale.  The  results  indicated  that,  if  content  areas  were  to  be 
weighted  appropriately,  some  type  of  intercontent  area  branching  scheme 
would  have  to  be  employed.  An  alternative  to  branching  might  be  to  adminis¬ 
ter  tailored  tests  over  uni  dimensional  subtests  and  to  report  a  profile 
of  scores.  Of  course,  this  alternative  carries  with  it  the  problem  of 
identifying  uni  dimensional  subtests,  as  well  as  determination  of  a  total 
score  when  one  is  desired. 


Attitude  Survey 

The  attitude  scale  results  were  generally  favorable  toward  tailored 
testing.  However,  there  was  no  evidence  to  indicate  any  interaction  be¬ 
tween  either  student  motivation  or  anxiety  levels  and  student  test  per¬ 
formance.  These  findings  were  consistent  with  the  findings  of  the  previous 
study,  which  found  no  significant  correlation  between  attitudes  of  the 
students  toward  the  tailored  tests  and  their  performance.  It  should  be 
emphasized  that  these  studies  were  performed  using  college  juniors  and 
seniors,  most  of  whom  were  females,  and  the  results  may  not  generalize 
to  other  groups. 

The  factor  structure  of  the  attitude  scale  appeared  to  be  unstable. 

Not  only  did  a  number  of  items  switch  factors,  but  the  factors  themselves 
changed  both  in  number  and  in  their  nature.  For  instance,  a  number  of 
items  that  loaded  on  separate  factors  in  the  first  administration  of  the 
scale  grouped  together  in  the  second  administration  to  form  a  new  factor 
that  did  not  occur  in  the  first  administration.  The  items  that  loaded 
on  this  new  factor,  labelled  miscellaneous,  were  items  that  did  not  appear 
to  be  related  at  all.  One  possible  reason  for  the  unstable  factor  struc¬ 
ture  of  the  scale  was  the  small  sample  size.  For  a  scale  of  20  items, 

88  is  not  an  adequate  number  of  subjects  to  obtain  a  stable  structure. 

It  is  interesting  to  note  that  when  an  analysis  of  the  factor  structure 
of  the  attitude  scale  using  the  skree  technique  was  performed  the  results 
were  ambiguous.  The  plot  of  eigenvalues  by  the  factors  Is  shown  In  Appen¬ 
dix  D.  The  number  of  factors  determined  using  the  eigenvalue-greater- 
than-one  rule  gave  probably  as  good  an  indication  of  the  number  of  factors 
as  that  obtained  from  the  skree  plot. 


Surnnary  and  Conclusions 


Past  studies  indicated  that  there  might  be  serious  problems  with 
the  application  of  tailored  testing  to  multidimensional  achievement  test- 


-30- 


*«.  + 

... 


ing.  However,  there  was  some  evidence  that  those  findings  were  the  result 
of  poor  Item  calibration,  linking  procedures,  and  test  administration 
procedures.  The  present  study  showed  that  If  sufficient  attention  was 
paid  to  establishing  proper  operational  characteristics,  tailored  test¬ 
ing  could  be  successfully  applied  to  multidimensional  achievement  tests 
to  the  extent  that  they  yielded  high  reliabilities  and  Information. 

The  results  of  this  study  Indicate  that  tailored  test  reliabilities 
for  both  the  1PL  and  3PL  procedures  were  probably  higher  than  the  relia¬ 
bility  of  the  classroom  test.  The  information  yielded  by  the  1PL  test 
was  almost  as  high  as  the  classroom  test  Information,  and  the  3PL  test 
information  was  higher  than  either  one.  The  fit  of  the  two  models  to 
the  response  data  showed  that  the  3PL  model  fit  the  data  better  than  the 
1PL  model.  Neither  procedure,  however,  had  adequate  content  validity. 

In  summary,  these  results  showed  that  tailored  testing  Is  a  viable  proce¬ 
dure  for  achievement  testing,  with  the  exception  of  content  validity, 
and  that  the  3PL  model  appears  to  be  the  model  of  choice. 
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APPENDIX  A 


Table  A-l 


Administration  Dates  and  Sample  Sizes 
of  Achievement  Tests  Calibrated  for 
Tailored  Testing  Usage 


Date 

Sample  Size 

9-76 

177 

2-77,  4-77 

314 

9-77,  10-77 

202 

2-78,  4-78 

309 

9-78,  11-78 

209 

2-79 

148 

Note.  Dates  given  in  month  and  year. 


Table  A-2 

Table  of  Specifications  for  Exam  I 


Analysis, 

Knowledge  of  Synthesis, 

Content  Terms  and  Application  and  Evaluation 

Areas  Techniques  of  Techniques  of  Techniques  Totals 


Planning  the  Test  1 
Behavioral  Objectives  1 
Table  of  Specifications  2 
Anecdotal  Records  1 
Rating  Scales  2 
Checklists  1 
Rankings  1 
Peer  Appraisals  1 
Self  Reports  1 
Selection  Items  2 
Supply  Items  1 


1 

2 

2 

2 

2 

2 

1 

1 

1 

3 

2 


5 

6 
5 
3 
2 
2 

8 

r 


Totals 


14 


19 


17 


50 


o>u<u 


fREOjl^HCT  ^  ip.oc  ty.oa  ,af  oo  a,oo  i>,ogFREO«,o<C*  itoc  lp.w  ijt.o> 


-34- 


Appendix  B 

ABILITY  ESTIMATE  FREQUENCY 


distributions 

FIGURE  B-l 
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APPENDIX  C 

Attitude  Survey  Administered  After  Each  Tailored  Testing  Session 

Please  circle  the  response  to  each  statement  below  which  most  nearly  re¬ 
flects  your  feelings  or  attitude. 

1.  During  the  test  I  was  worried  about  how  well  I  was  doing. 

strongly  strongly 

agree  agree  neutral  disagree  disagree 

2.  I  felt  less  time  pressure  while  taking  this  computerized  test  than 
while  taking  conventional  tests. 

strongly  strongly 

agree  agree  neutral  disagree  disagree 

3.  I  felt  that  many  of  the  items  were  too  difficult  for  me. 

strongly  strongly 

disagree  disagree  neutral  agree  agree 

4.  The  computer  terminal  made  me  feel  that  I  had  to  answer  the  items 
as  quickly  as  possible. 

strongly  strongly 

agree  agree  neutral  disagree  disagree 

5.  I  didn't  care  very  much  about  how  well  I  did  on  the  test. 

strongly  strongly 

disagree  disagree  neutral  agree  agree 

6.  I  think  I  did  well  on  the  test  compared- to  other  people. 

strongly  strongly 

agree  agree  neutral  disagree  disagree 

7.  I  felt  that  my  performance  on  this  test  reflected  n\y  true  knowledge 
of  A140. 

strongly  strongly 

disagree  disagree  neutral  agree  agree 

8.  My  eyes  were  uncomfortable  when  viewing  the  screen. 

strongly  strongly 

agree  agree  neutral  disagree  disagree 

9.  I  felt  that  most  of  the  items  on  this  test  were  too  easy. 

strongly  strongly 

disagree  disagree  neutral  agree  agree 
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10.  I  was  nervous  about  coming  here  to  take  this  test. 

strongly  Strongly 

agree  agree  neutral  disagree  disagree 

11.  The  pace  of  the  computer  was  so  slow  that  it  made  me  impatient. 

strongly  strongly 

disagree  disagree  neutral  agree  agree 

12.  I  feel  that  I  did  as  well  on  this  test  as  on  other  tests  I've  taken. 

strongly  strongly 

agree  agree  neutral  disagree  disagree 

13.  The  computer  terminal  made  me  nervous. 

strongly  strongly 

agree  agree  neutral  disagree  disagree 

14.  I  felt  confident  that  I  did  well  on  the  test. 

strongly  strongly 

disagree  disagree  neutral  agree  agree 

15.  I  felt  considerable  stress  while  taking  the  test. 
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